{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b12288f-89fe-4163-933e-bb97a0f80def",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b6172d4-6444-4a86-83dc-4e9448bcb745",
   "metadata": {},
   "source": [
    "### Getting Started\n",
    "For embedding, we'll be using a model from [SentenceTransformers](https://sbert.net/), which allows us to use a fast and light version of BERT without the heavy compute overhead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "11d4a9d1-4c43-44e2-9952-2cf6459141e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import logging\n",
    "import json\n",
    "from getpass import getpass\n",
    "\n",
    "from elasticsearch import Elasticsearch, helpers\n",
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "model = SentenceTransformer(\"sentence-transformers/msmarco-MiniLM-L-12-v3\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d47ca27e-2d83-41af-aea3-541e169cd540",
   "metadata": {},
   "source": [
    "Make sure you have access to your [Elastic Cloud ID](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id) and [Elastic API Key](https://www.elastic.co/guide/en/cloud/current/ec-api-authentication.html#ec-api-keys). The next code snippet will prompt you for both and connect to Elasticsearch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7abedbcb-f18b-4f82-83b1-e271ae17a8ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "cloud_endpoint = getpass(\"Elastic deployment Cloud Endpoint: \")\n",
    "cloud_api_key = getpass(\"Elastic deployment API Key: \")\n",
    "INDEX_NAME = \"books-local\"\n",
    "\n",
    "es = Elasticsearch(\n",
    "    hosts=cloud_endpoint,\n",
    "    api_key=cloud_api_key,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a54fa50a-2976-4d70-a447-1f742045d2ee",
   "metadata": {},
   "source": [
    "## 1. Embedding text into vectors\n",
    "Our book documents are located in `../data/books.json`. They do not currently have any vectors. The next section will parse through a small batch of 25 book objects and create vector embeddings for each book_description. If you would like to run this on all 10,909 book objects, change the file_path to `../data/books.json` instead of `../data/small_books.json`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "9eb0c477-b935-4930-88a9-9a3fb8b1fa61",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open the small books json file\n",
    "file_path = \"../data/small_books.json\"\n",
    "with open(file_path, \"r\") as file:\n",
    "    books = json.load(file)\n",
    "\n",
    "# pull out all book_description fields\n",
    "book_descriptions = [book[\"book_description\"] for book in books]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56796f30-369b-4e6d-8bd8-98ebac530f33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert the array of \"book_description\" text to vectors\n",
    "pool = model.start_multi_process_pool()\n",
    "embedded_books = model.encode_multi_process(book_descriptions, pool)\n",
    "model.stop_multi_process_pool(pool)\n",
    "print(f\"{len(embedded_books)} vectors created!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf25dec8-5620-4ba8-a0dd-b4db721636bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# add the new vectors bacvk into each book object under the new field \"description_embedding\"\n",
    "for i, book in enumerate(books):\n",
    "    book[\"description_embedding\"] = embedded_books[i].tolist()\n",
    "\n",
    "print(f\"Embeddings added to {len(books)} books.\")\n",
    "\n",
    "\n",
    "# write the embedded books to a new json array of documents\n",
    "output_file = \"../data/small_books_embedded.json\"\n",
    "with open(output_file, \"w\") as file:\n",
    "    file.write(\"[\")\n",
    "    for i, book in enumerate(books):\n",
    "        json.dump(book, file)\n",
    "        if i != len(books) - 1:\n",
    "            file.write(\",\")\n",
    "    file.write(\"]\")\n",
    "print(f\"{len(books)} embedded books saved to file: {output_file}.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c93449e-4192-40b4-b5ec-630e33aa2097",
   "metadata": {},
   "source": [
    "We can inspect the file and observe that there are now vector embeddings added to each document. Let's take a look at the first book in the `small_books_embedded.json` file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f63ebf93-ac58-4559-9326-4ccfe8365e48",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pprint\n",
    "\n",
    "pp = pprint.PrettyPrinter(indent=2)\n",
    "\n",
    "with open(output_file, \"r\") as file:\n",
    "    books = json.load(file)\n",
    "    pp.pprint(books[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d6fd772-411b-4f21-ab2f-89027014ba24",
   "metadata": {},
   "source": [
    "Now lets create an index for our books using the Elasticsearch client."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1bdbb18d-c804-48d4-b278-0b505eeba104",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Lets define our mappings\n",
    "mappings = {\n",
    "    \"mappings\": {\n",
    "        \"properties\": {\n",
    "            \"book_title\": {\"type\": \"text\"},\n",
    "            \"author_name\": {\"type\": \"text\"},\n",
    "            \"rating_score\": {\"type\": \"float\"},\n",
    "            \"rating_votes\": {\"type\": \"integer\"},\n",
    "            \"review_number\": {\"type\": \"integer\"},\n",
    "            \"book_description\": {\"type\": \"text\"},\n",
    "            \"genres\": {\"type\": \"keyword\"},\n",
    "            \"year_published\": {\"type\": \"integer\"},\n",
    "            \"url\": {\"type\": \"text\"},\n",
    "            \"description_embedding\": {\"type\": \"dense_vector\", \"dims\": 384},\n",
    "        }\n",
    "    }\n",
    "}\n",
    "\n",
    "# Delete any previous index\n",
    "es.indices.delete(index=INDEX_NAME, ignore_unavailable=True)\n",
    "es.indices.create(index=INDEX_NAME, body=mappings)\n",
    "print(f\"Index '{INDEX_NAME}' created.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "601d31c4-617b-4d5a-9352-6ff1f759cae7",
   "metadata": {},
   "source": [
    "Now that we have created an index in Elasticsearch, we can index our local book objects. This `bulk_ingest_books` method will make indexing documents much faster than if we were to run an index function on each individual book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e1dfa21-a500-4168-b920-ca0eb128be27",
   "metadata": {},
   "outputs": [],
   "source": [
    "file_path = \"../data/small_books_embedded.json\"\n",
    "with open(file_path, \"r\") as file:\n",
    "    books = json.load(file)\n",
    "\n",
    "# create an array of index actions, with each element holding one document\n",
    "actions = [\n",
    "    {\"_index\": INDEX_NAME, \"_id\": book.get(\"id\", None), \"_source\": book}\n",
    "    for book in books\n",
    "]\n",
    "\n",
    "try:\n",
    "    helpers.bulk(es, actions, chunk_size=1000)\n",
    "    print(f\"Successfully added {len(actions)} books into the '{INDEX_NAME}' index.\")\n",
    "\n",
    "except helpers.BulkIndexError as e:\n",
    "    print(f\"Error occurred while ingesting books: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "205538b7-fa8a-41b9-8381-7028759891ea",
   "metadata": {},
   "source": [
    "Lets also add one book, as this would be a standard function as you add new books to your vector database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "c00f8d4d-a2b8-4f79-a3cf-cef532aedb95",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"../data/one_book_embedded.json\", \"r\") as file:\n",
    "    book = json.load(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c221dffe-e3ef-4467-876b-d4e01cac1b7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    resp = es.index(\n",
    "        index=INDEX_NAME,\n",
    "        id=book.get(\"id\", None),\n",
    "        body=book,\n",
    "    )\n",
    "\n",
    "    print(f\"Successfully indexed book! Result: {resp.get('result', None)}\")\n",
    "\n",
    "except Exception as e:\n",
    "    print(f\"Error occurred while indexing book: {e}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5299a667-ebea-4d0a-820d-c46c70d8031f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
